28 research outputs found
The Flexible Group Spatial Keyword Query
We present a new class of service for location based social networks, called
the Flexible Group Spatial Keyword Query, which enables a group of users to
collectively find a point of interest (POI) that optimizes an aggregate cost
function combining both spatial distances and keyword similarities. In
addition, our query service allows users to consider the tradeoffs between
obtaining a sub-optimal solution for the entire group and obtaining an
optimimized solution but only for a subgroup.
We propose algorithms to process three variants of the query: (i) the group
nearest neighbor with keywords query, which finds a POI that optimizes the
aggregate cost function for the whole group of size n, (ii) the subgroup
nearest neighbor with keywords query, which finds the optimal subgroup and a
POI that optimizes the aggregate cost function for a given subgroup size m (m
<= n), and (iii) the multiple subgroup nearest neighbor with keywords query,
which finds optimal subgroups and corresponding POIs for each of the subgroup
sizes in the range [m, n]. We design query processing algorithms based on
branch-and-bound and best-first paradigms. Finally, we provide theoretical
bounds and conduct extensive experiments with two real datasets which verify
the effectiveness and efficiency of the proposed algorithms.Comment: 12 page
Bulk Insertions into xBR+ -trees
Bulk insertion refers to the process of updating an existing index by inserting a large batch of new data, treating the items of this batch as a whole and not by inserting these items one-by-one. Bulk insertion is related to bulk loading, which refers to the process of creating a non-existing index from scratch, when the dataset to be indexed is available beforehand. The xBR + -tree is a balanced, disk-resident, Quadtree-based index for point data, which is very efficient for processing spatial queries. In this paper, we present the first algorithm for bulk insertion into xBR+ -trees. This algorithm incorporates extensions of techniques that we have recently developed for bulk loading xBR+ -trees. Moreover, using real and artificial datasets of various cardinalities, we present an experimental comparison of this algorithm vs. inserting items one-by-one for updating xBR+ -trees, regarding performance (I/O and execution time) and the characteristics of the resulting trees. We also present experimental results regarding the query-processing efficiency of xBR+ -trees built by bulk insertions vs. xBR+ -trees built by inserting items one-by-one
Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs
We demonstrate that a graph-based search algorithm-relying on the
construction of an approximate neighborhood graph-can directly work with
challenging non-metric and/or non-symmetric distances without resorting to
metric-space mapping and/or distance symmetrization, which, in turn, lead to
substantial performance degradation. Although the straightforward metrization
and symmetrization is usually ineffective, we find that constructing an index
using a modified, e.g., symmetrized, distance can improve performance. This
observation paves a way to a new line of research of designing index-specific
graph-construction distance functions
Using metric space indexing for complete and efficient record linkage
Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.Postprin
Recommended from our members
Anonymisation of geographical distance matrices via Lipschitz embedding
BACKGROUND: Anonymisation of spatially referenced data has received increasing attention in recent years. Whereas the research focus has been on the anonymisation of point locations, the disclosure risk arising from the publishing of inter-point distances and corresponding anonymisation methods have not been studied systematically.
METHODS: We propose a new anonymisation method for the release of geographical distances between records of a microdata file-for example patients in a medical database. We discuss a data release scheme in which microdata without coordinates and an additional distance matrix between the corresponding rows of the microdata set are released. In contrast to most other approaches this method preserves small distances better than larger distances. The distances are modified by a variant of Lipschitz embedding.
RESULTS: The effects of the embedding parameters on the risk of data disclosure are evaluated by linkage experiments using simulated data. The results indicate small disclosure risks for appropriate embedding parameters.
CONCLUSION: The proposed method is useful if published distance information might be misused for the re-identification of records. The method can be used for publishing scientific-use-files and as an additional tool for record-linkage studies
Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts